Description of ARMORD data

### Study data 
As in the supplementary data file (processed versions of these in the Model Data folder are used in linear models)

1) AMR_genes_ARIBA_CARD.csv
Output of ARIBA AMR gene local assembler report for each sequenced sample (run using CARD ressitance database). One row per assembled AMR gene. Further details can be found here https://github.com/sanger-pathogens/ariba/wiki/Task%3A-run
pid - participant ID 
no - sample number
samp_id	- sample ID (includes alphanumeric participant ID and sample number)
ref_name - original name of reference sequence chosen from cluster
CARD_accession_number - AMR gene accession number in CARD database
cluster - name of CARD AMR gene cluster
var_only - 1=variant only, 0=presence/absence
rpkm - reads in cluster per AMR gene kilobase per million reads
rpm - reads in cluster per million reads
reads - number of reads
nonhuman_reads - number of nonhuman reads (all subsampled to 3.5 million reads)
max_cov - number of reference nucleotides assembled by this contig divided by reference gene length (i.e. ref_base_assembled/ref_len)
pc_ident - %identity between reference sequence and contig
ref_len - reference gene length
ref_base_assembled - number of reference nucleotides assembled by this contig
ctg_len - length of contig
ctg_cov - mean mapped read depth of this contig
free_text - other free text about reference sequence, from CARD database

2) Antimicrobial_exposures.csv
Antimicrobial data from electronic records and study CRFs, each row is a course of antimicrobials
pid - participant ID 
drug - antimicrobial name
route - route of administration
first_sample - first stool sample from participant (used for relative date of course start/end)
drug_group - category of antimicrobial (A = aminoglycoside, B = beta_lactam_broad, C = clindamycin, F = antifolate, G = glycopeptide, M = macrolide, N = metronidazole, O = other, P = beta_lactam_narrow, Q = quinolone, T = tetracycline, U = unknown, V = antiviral, Z = antifungal)
dose_interval - dosing interval of drug
firstdose_days_after_first_sample - start date of course in days, relative to first stool sample
lastdose_days_after_first_sample - end date of course in days, relative to first stool sample

3) card.obo 
CARD AMR ontology/database file (for reference - was used to create AMR_genes_ARIBA_CARD.csv output but not needed to reproduce plots)

4) patients.csv
List of ARMORD participants with one row per person
pid - participant ID (numeric, note an alphanumeric system was also used in which CS = cross-sectional and LS = longitudinal, both starting from 1, this was replaced with a purely numeric system in which LS patients started from 500)
age_category - ordinal age (5 year groups)
age_group - 5-year age group at enrolment
sex - sex (1 = male, 0 = female)
category - Medical = medical in/outpatient, Healthy = healthy volunteer, Haem_autograft = haematology patient having autologous SCT, Heam_allograft = haematology patient having allogeneic SCT

5) Samples.csv
List of sequenced samples. Not all collected samples were sequenced, so sample numbering for participants is not necessarily consecutive
pid - participant ID 
sample_order - sample number for that participant, starting at 1 for first sample
samp_id - sample ID (combining alphanumeric participant ID and sample order)
collected_days_after_first_sample - time of collection relative to first sample (days)
bristol_score - sample consistency on bristol scale
filtered_reads - number of reads filtered during QC
human_reads - number of human reads in original sequence data (removed prior to analysis & removed from sequence data in public repository)
nonhuman_reads (number of nonhuman reads - subsampled to 3500000)
total_reads (total reads)

6) Taxa_Metaphlan.csv
Taxonomic profile from Metaphlan
samp_id - sample id
kingdom/phylum/class/order/family/genus/species - taxa from metaphlan output
perc - percentage abundance of taxon
seq_run - sequence run


### Model data
The following processed input data are used in linear models. Files starting 'c_' are for cross-sectional models, and those starting 'l_' are for longitudinal models.

1) b_first_samples.csv
List of first samples used for cross sectional model
2) c_argRA.csv
Log relative abundance of major categories of AMR genes from ARIBA ouput
3) c_bugRA.csv
Log relative abundance of major taxa from metaphlan output
4) c_cat_charlson.csv
Charlson co-morbidity score for each patient
5) c_cat_max_news_pre_sample.csv
maximum track and trigger score recorded for each patient prior to sample collection
6) c_conditioning.csv
days after conditioning started (with or without truncation)
7) c_crp.csv
High CRP recorded prior to sample collection (1 = yes)
8) c_diversity_indices_mp.csv
Diversity indices for samples
9) c_patients.csv
List of patients
10) c_wcc.csv
High or low WCC count recorded prior to sample collection
11) l_argRA.csv
Log relative abundance of major categories of AMR genes from ARIBA ouput
12) l_bugRA.csv
Log relative abundance of major taxa from metaphlan output
13) l_crp.csv
change in CRP between samples (categorised as new high CRP, yes = 1)
14) l_diversity_indices.csv
Change in diversity between pairs
15) l_news.csv
Change in track and trigger score between pairs
16) l_pairs.csv
Deatils of sample pairs time in days separating them
17) l_patients.csv
patients in longitudinal model
18) l_wcc.csv
New high or low WCC between samples
19) number_of_first_samples_with_each_AM_class_exposure.csv
20) number_of_first_samples_with_each_AM_drug_exposure.csv
21) number_of_pairs_with_each_AM_class_exposure.csv
22) number_of_pairs_with_each_AM_drug_exposure.csv
'number_of_[...]' list the number of patietns with each exposure in cross-sectional and longitudinal models, for plotting
23) table_of_pairs_with_AM_class_exposures.csv
24) table_of_pairs_with_AM_drug_exposures.csv
25) table_of_samples_with_AM_class_exposures.csv
26) table_of_samples_with_AM_drug_exposures.csv
'table_of_[...]' list exposures to each antimicrobial drug or class (calulated as described in the paper)